BACK

Statistical Inference Course Notes

Overview

  • Statistical Inference = generating conclusions about a population from a noisy sample
  • Goal = extend beyond data to population
  • Statistical Inference = only formal system of inference we have
  • many different modes, but two broad flavors of inference (inferential paradigms): Bayesian vs Frequencist
    • Frequencist = uses long run proportion of times an event occurs independent identically distributed repetitions
      • frequentist is what this class is focused on
      • believes if an experiment is repeated many many times, the resultant percentage of success/something happening defines that population parameter
    • Bayesian = probability estimate for a hypothesis is updated as additional evidence is acquired
  • statistic = number computed from a sample of data
    • statistics are used to infer information about a population
  • random variable = outcome from an experiment
    • deterministic processes (variance/means) produce additional random variables when applied to random variables, and they have their own distributions

Probability

  • Probability = the study of quantifying the likelihood of particular events occurring
    • given a random experiment, probability = population quantity that summarizes the randomness
      • not in the data at hand, but a conceptual quantity that exist in the population that we want to estimate

General Probability Rules

  • discovered by Russian mathematician Kolmogorov, also known as “Probability Calculus”
  • probability = function of any set of outcomes and assigns it a number between 0 and 1
    • \(0 \le P(E) \le 1\), where \(E\) = event
  • probability that nothing occurs = 0 (impossible, have to roll dice to create outcome), that something occurs is 1 (certain)
  • probability of outcome or event \(E\), \(P(E)\) = ratio of ways that \(E\) could occur to number of all possible outcomes or events
  • probability of something = 1 - probability of the opposite occurring
  • probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) = sum of respective probabilities

  • if \(A\) implies occurrence of \(B\), then \(P(A)\) occurring \(< P(B)\) occurring

  • for any two events, probability of at least one occurs = the sum of their probabilities - their intersection (in other words, probabilities can not be added simply if they have non-trivial intersection)

  • for independent events \(A\) and \(B\), \(P(A \:\cup\: B) = P(A) \times P(B)\)
  • for outcomes that can occur with different combination of events and these combinations are mutually exclusive, the \(P(E_{total}) = \sum P(E_{part})\)

Conditional Probability

  • let B = an event so that \(P(B) > 0\)
  • conditional probability of an event A, given B is defined as the probability that BOTH A and B occurring divided by the probability of B occurring \[P(A\:|\:B) = \frac{P(A \:\cap\: B)}{P(B)}\]
  • if A and B are independent, then \[P(A\:|\:B) = \frac{P(A)P(B)}{P(B)} = P(A)\]
  • example
    • for die roll, \(A = \{1\}\), \(B = \{1, 3, 5\}\), then \[P(1~|~Odd) = P(A\:|\:B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)}{P(B)} = \frac{1/6}{3/6} = \frac{1}{3}\]

Baye’s Rule

  • definition \[P(B\:|\:A) = \frac{P(A\:|\:B)P(B)}{P(A\:|\:B)P(B)+P(A\:|\:B^c)P(B^c)}\] where \(B^c\) = corresponding probability of event \(B\), \(P(B^c) = 1 - P(B)\)

Random Variables

  • random variable = numeric outcome of experiment
  • discrete (what you can count/categories) = assign probabilities to every number/value the variable can take
    • coin flip, rolling a die, web traffic in a day
  • continuous (any number within a continuum) = assign probabilities to the range the variable can take
    • BMI index, intelligence quotients
    • Note: limitations of precision in taking the measurements may imply that the values are discrete, but we in fact consider them continuous
  • rbinom(), rnorm(), rgamma(), rpois(), runif() = functions to generate random variables from the binomial, normal, Gamma, Poisson, and uniform distributions
  • density and mass functions (population quantities, not what occurs in data) for random variables = best starting point to model/think about probabilities for numeric outcome of experiments (variables)
    • use data to estimate properties of population \(\rightarrow\) linking sample to population

Probability Mass Function (PMF)

  • evaluates the probability that the discrete random variable takes on a specific value
    • measures the chance of a particular outcome happening
    • always \(\ge\) 0 for every possible outcome
    • \(\sum\) possible values that the variable can take = 1
  • Bernoulli distribution example
    • X = 0 \(\rightarrow\) tails, X = 1 \(\rightarrow\) heads
      • X here represents potential outcome
    • \(P(X = x) = (\frac{1}{2})^x(\frac{1}{2})^{1-x}\) for \(X = 0, 1\)
      • \(x\) here represents a value we can plug into the PMF
      • general form \(\rightarrow\) \(p(x) = (\theta)^x(1-\theta)^{1-x}\)
  • dbinom(k, n, p) = return the probability of getting k successes out of n trials, given probability of success is p

Probability Density Function (PDF)

  • evaluates the probability that the continuous random variable takes on a specific value
    • always \(\ge\) 0 everywhere
    • total area under curve must = 1
  • areas under PDFs correspond to the probabilities for that random variable taking on that range of values (PMF)

  • but the probability of the variable taking a specific value = 0 (area of a line is 0)

* Note: the above is true because it is modeling random variables as if they have infinite precision, when in reality they do not

  • dnorm(), dgamma(), dpois(), dunif() = return probability of a certain value from the normal, Gamma, Poisson, and uniform distributions

Cumulative Distribution Function (CDF)

  • CDF of a random variable \(X\) = probability that the random variable is \(\le\) value \(x\)
    • \(F(x) = P(X \le x)\) = applies when \(X\) is discrete/continuous
  • PDF = derivative of CDF
    • integrate PDF \(\rightarrow\) CDF
      • integrate(function, lower=0, upper=1) \(\rightarrow\) can be used to evaluate integrals for a specified range
  • pbinom(), pnorm(), pgamma(), ppois(), punif() = returns the cumulative probabilities from 0 up to a specified value from the binomial, normal, Gamma, Poisson, and uniform distributions

Survival Function

  • survival function of a random variable \(X\) = probability the random variable \(> x\), complement of CDF
    • \(S(x) = P(X > x) = 1 - F(x)\), where \(F(x) =\) CDF

Quantile

  • the \(\alpha^{th}\) quantile of a distribution with distribution function F = point \(x_{\alpha}\)
    • \(F(x_{\alpha}) = \alpha\)
    • percentile = quantile with \(\alpha\) expressed as a percent
    • median = 50th percentile
    • \(\alpha\)% of the possible outcomes lie below it

  • qbeta(quantileInDecimals, 2, 1) = returns quantiles for beta distribution
    • works for qnorm(), qbinom(), qgamma(), qpois(), etc.
  • median estimated in this fashion = a population median
  • probability model connects data to population using assumptions
    • population median = estimand, sample median = estimator

Independence

  • two events \(A\) and \(B\) are independent if the following is true
    • \(P(A\:\cap\:B) = P(A)P(B)\)
    • \(P(A\:|\:B) = P(A)\)
  • two random variables \(X\) and \(Y\) are independent, if for any two sets, A and B, the following is true
    • \(P([X \in A]\cap[Y \in B]) = P(X \in A)P(Y \in B)\)
  • independence = statistically unrelated from one another
  • if \(A\) is independent of \(B\), then the following are true
    • \(A^c\) is independent of \(B\)
    • \(A\) is independent of \(B^c\)
    • \(A^c\) is independent of \(B^c\)

IID Random Variables

  • random variables are said to be IID if they are independent and identically distributed
    • independent = statistically unrelated from each other
    • identically distributed = all having been drawn from the same population distribution
  • IID random variables = default model for random samples = default starting point of inference

Diagnostic Test

  • Let \(+\) and \(-\) be the results, positive and negative respectively, of a diagnostic test
  • Let \(D\) = subject of the test has the disease, \(D^c\) = subject does not
  • sensitivity = \(P(+\:|\:D)\) = probability that the test is positive given that the subject has the disease (the higher the better)
  • specificity = \(P(-\:|\:D^c)\) = probability that the test is negative given that the subject does not have the disease (the higher the better)
  • positive predictive value = \(P(D\:|\:+)\) = probability that that subject has the disease given that the test is positive
  • negative predictive value = \(P(D^c\:|\:-)\) = probability that the subject does not have the disease given the test is negative
  • prevalence of disease = \(P(D)\) = marginal probability of disease

Example

  • specificity of 98.5%, sensitivity = 99.7%, prevalence of disease = .1% \[\begin{aligned} P(D ~|~ +) & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ & = \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ & = .062 \end{aligned}\]
  • low positive predictive value \(\rightarrow\) due to low prevalence of disease and somewhat modest specificity
    • suppose it was know that the subject uses drugs and has regular intercourse with an HIV infect partner (his probability of being + is higher than suspected)
    • evidence implied by a positive test result

Likelihood Ratios

  • diagnostic likelihood ratio of a positive test result is defined as \[DLR_+ = \frac{sensitivity}{1-specificity} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)}\]
  • diagnostic likelihood ratio of a negative test result is defined as \[DLR_- = \frac{1 - sensitivity}{specificity} = \frac{P(-\:|\:D)}{P(-\:|\:D^c)}\]
  • from Baye’s Rules, we can derive the positive predictive value and false positive value \[P(D\:|\:+) = \frac{P(+\:|\:D)P(D)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(1)}\] \[P(D^c\:|\:+) = \frac{P(+\:|\:D^c)P(D^c)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(2)}\]
  • if we divide equation \((1)\) over \((2)\), the quantities over have the same denominator so we get the following \[\frac{P(D\:|\:+)}{P(D^c\:|\:+)} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)} \times \frac{P(D)}{P(D^c)}\] which can also be written as \[\mbox{post-test odds of D} = DLR_+ \times \mbox{pre-test odds of D}\]
    • odds = \(p/(1-p)\)
    • \(\frac{P(D)}{P(D^c)}\) = pre-test odds, or odds of disease in absence of test
    • \(\frac{P(D\:|\:+)}{P(+\:|\:D^c)}\) = post-test odds, or odds of disease given a positive test result
    • \(DLR_+\) = factor by which the odds in the presence of a positive test can be multiplied to obtain the post-test odds
    • \(DLR_-\) = relates the decrease in odds of disease after a negative result
  • following the previous example, for sensitivity of 0.997 and specificity of 0.985, so the diagnostic likelihood ratios are as follows \[DLR_+ = .997/(1-.985) = 66 ~~~~~~ DLR_- =(1-.997)/.985 = 0.003\]
    • this indicates that the result of the positive test is the odds of disease is 66 times the pretest odds

Expected Values/Mean

  • useful for characterizing a distribution (properties of distributions)
  • mean = characterization of the center of the distribution = expected value
  • expected value operation = linear \(\rightarrow\) \(E(aX +bY) = aE(X) + bE(Y)\)
  • variance/standard deviation = characterization of how spread out the distribution is
  • sample expected values for sample mean and variance will estimate the population counterparts

  • population mean
    • expected value/mean of a random variable = center of its distribution (center of mass)
    • discrete variables
      • for \(X\) with PMF \(p(x)\), the population mean is defined as \[E[X] = \sum_{x} xp(x)\] where the sum is taken over all possible values of \(x\)
      • \(E[X]\) = center of mass of a collection of location and weights \({x,~p(x)}\)
      • coin flip example: \(E[X] = 0 \times (1-p) + 1 \times p = p\)
    • continuous variable
      • for \(X\) with PDF \(f(x)\), the expected value = the center of mass of the density
      • instead of summing over discrete values, the expectation integrates over a continuous function
        • PDF = \(f(x)\)
        • \(\int xf(x)\) = area under the PDF curve = mean/expected value of \(X\)
  • sample mean
    • sample mean estimates the population mean
      • sample mean = center of mass of observed data = empirical mean \[\bar X = \sum_{x}^n x_i p(x_i)\] where \(p(x_i) = 1/n\)
# load relevant packages
library(UsingR); data(galton); library(ggplot2)
# plot galton data
g <- ggplot(galton, aes(x = child))
# add histogram for children data
g <- g + geom_histogram(fill = "salmon", binwidth=1, aes(y=..density..), colour="black")
# add density smooth
g <- g + geom_density(size = 2)
# add vertical line
g <- g + geom_vline(xintercept = mean(galton$child), size = 2)
# print graph
g

  • average of random variables = a new random variable where its distribution has an expected value that is the same as the original distribution (centers are the same)
    • the mean of the averages = average of the original data \(\rightarrow\) estimates average of the population
    • if \(E[sample~mean]\) = population mean, then estimator for the sample mean is unbiased
      • [derivation] let \(X_1\), \(X_2\), \(X_3\), … \(X_n\) be a collection of \(n\) samples from the population with mean \(\mu\)
      • mean of this sample \[\bar X = \frac{X_1 + X_2 + X_3 + . + X_n}{n}\]
      • since \(E(aX) = aE(X)\), the expected value of the mean is can be written as \[E\left[\frac{X_1 + X_2 + X_3 + ... + X_n}{n}\right] = \frac{1}{n} \times [E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)]\]
      • since each of the \(E(X_i)\) is drawn from the population with mean \(\mu\), the expected value of each sample should be \[E(X_i) = \mu\]
      • therefore \[\begin{aligned} E\left[\frac{X_1 + X_2 + X_3 + ... + X_n}{n}\right] & = \frac{1}{n} \times [E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)]\\ & = \frac{1}{n} \times [\mu + \mu + \mu + ... + \mu]\\ & = \frac{1}{n} \times n \times \mu\\ & = \mu\\ \end{aligned}\]
  • Note: the more data that goes into the sample mean, the more concentrated its density/mass functions are around the population mean
nosim <- 1000
# simulate data for sample size 1 to 4
dat <- data.frame(
    x = c(sample(1 : 6, nosim, replace = TRUE),
        apply(matrix(sample(1 : 6, nosim * 2, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1 : 6, nosim * 3, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1 : 6, nosim * 4, replace = TRUE), nosim), 1, mean)),
    size = factor(rep(1 : 4, rep(nosim, 4))))
# plot histograms of means by sample size
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.25, colour = "black")
g + facet_grid(. ~ size)

Variance

# generate x value ranges
xvals <- seq(-10, 10, by = .01)
# generate data from normal distribution for sd of 1 to 4
dat <- data.frame(
    y = c(dnorm(xvals, mean = 0, sd = 1),
        dnorm(xvals, mean = 0, sd = 2),
        dnorm(xvals, mean = 0, sd = 3),
        dnorm(xvals, mean = 0, sd = 4)),
    x = rep(xvals, 4),
    factor = factor(rep(1 : 4, rep(length(xvals), 4)))
)
# plot 4 lines for the different standard deviations
ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)

  • variance = measure of spread or dispersion, the expected squared distance of the variable from its mean (expressed in \(X\)’s units\(^2\))
    • as we can see from above, higher variances \(\rightarrow\) more spread, lower \(\rightarrow\) smaller spread
    • \(Var(X) = E[(X-\mu)^2] = E[X^2] - E[X]^2\)
    • standard deviation \(= \sqrt{Var(X)}\) \(\rightarrow\) has same units as X
    • example
      • for die roll, \(E[X] = 3.5\)
      • \(E[X^2] = 1^2 \times 1/6 + 2^2 \times 1/6 + 3^2 \times 1/6 + 4^2 \times 1/6 + 5^2 \times 1/6 + 6^2 \times 1/6 = 15.17\)
      • \(Var(X) = E[X^2] - E[X]^2 \approx 2.92\)
    • example
      • for coin flip, \(E[X] = p\)
      • \(E[X^2] = 0^2 \times (1 - p) + 1^2 \times p= p\)
      • \(Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1-p)\)

Sample Variance

  • the sample variance is defined as \[S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1}\]

  • on the above line representing the population (in magenta), any subset of data (3 of 14 selected, marked in blue) will most likely have a variance that is lower than the population variance
  • dividing by \(n - 1\) will make the variance estimator larger to adjust for this fact \(\rightarrow\) leads to more accurate estimation \(\rightarrow\) \(S^2\) = so called unbiased estimate of population variance
    • \(S^2\) is a random variable, and therefore has an associated population distribution
      • \(E[S^2]\) = population variance, where \(S\) = sample standard deviation
      • as we see from the simulation results below, with more data, the distribution for \(S^2\) gets more concentrated around population variance
# specify number of simulations
nosim <- 10000;
# simulate data for various sample sizes
dat <- data.frame(
    x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
          apply(matrix(rnorm(nosim * 20), nosim), 1, var),
          apply(matrix(rnorm(nosim * 30), nosim), 1, var)),
    n = factor(rep(c("10", "20", "30"), c(nosim, nosim, nosim))) )
# plot density function for different sample size data
ggplot(dat, aes(x = x, fill = n)) + geom_density(size = 1, alpha = .2) +
    geom_vline(xintercept = 1, size = 1)

  • Note: for any variable, properties of the population = parameter, estimates of properties for samples = statistic
    • below is a summary for the mean and variance for population and sample

  • distribution for mean of random samples
    • expected value of the mean of distribution of means = expected value of the sample mean = population mean
      • \(E[\bar X]=\mu\)
    • expected value of the variance of distribution of means
      • \(Var(\bar X) = \sigma^2/n\)
      • as n becomes larger, the mean of random sample \(\rightarrow\) more concentrated around the population mean \(\rightarrow\) variance approaches 0
        • this again confirms that sample mean estimates population mean
    • Note: normally we only have 1 sample mean (from collected sample) and can estimate the variance \(\sigma^2\) \(\rightarrow\) so we know a lot about the distribution of the means from the data observed
  • standard error (SE)
    • the standard error of the mean is defined as \[SE_{mean} = \sigma/\sqrt{n}\]
    • this quantity is effectively the standard deviation of the distribution of a statistic (i.e. mean)
    • represents variability of means

Entire Estimator-Estimation Relationship

  • Start with a sample
  • \(S^2\) = sample variance
    • estimates how variable the population is
    • estimates population variance \(\sigma^2\)
    • \(S^2\) = a random variable and has its own distribution centered around \(\sigma^2\)
      • more concentrated around \(\sigma^2\) as \(n\) increases
  • \(\bar X\) = sample mean
    • estimates population mean \(\mu\)
    • \(\bar X\) = a random variable and has its own distribution centered around \(\mu\)
      • more concentrated around \(\mu\) as \(n\) increases
      • variance of distribution of \(\bar X = \sigma^2/n\)
      • estimate of variance = \(S^2/n\)
      • estimate of standard error = \(S/\sqrt{n}\) \(\rightarrow\) “sample standard error of the mean”
        • estimates how variable sample means (\(n\) size) from the population are

Example - Standard Normal

  • variance = 1
  • means of n standard normals (sample) have standard deviation = \(1/\sqrt{n}\)
# specify number of simulations with 10 as number of observations per sample
nosim <- 1000; n <-10
# estimated standard deviation of mean
sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
## [1] 0.309473
# actual standard deviation of mean of standard normals
1 / sqrt(n)
## [1] 0.3162278
  • rnorm() = generate samples from the standard normal
  • matrix() = puts all samples into a nosim by \(n\) matrix, so that each row represents a simulation with nosim observations
  • apply() = calculates the mean of the \(n\) samples
  • sd() = returns standard deviation

Example - Standard Uniform

  • standard uniform \(\rightarrow\) triangle straight line distribution \(\rightarrow\) mean = 1/2 and variance = 1/12
  • means of random samples of \(n\) uniforms have have standard deviation of \(1/\sqrt{12 \times n}\)
# estimated standard deviation of the sample means
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
## [1] 0.09349955
# actual standard deviation of the means
1/sqrt(12*n)
## [1] 0.09128709

Example - Poisson

  • \(Poisson(x^2)\) have variance of \(x^2\)
  • means of random samples of \(n~ Poisson(4)\) have standard deviation of \(2/\sqrt{n}\)
# estimated standard deviation of the sample means
sd(apply(matrix(rpois(nosim * n, lambda=4), nosim), 1, mean))
## [1] 0.6393796
# actual standard deviation of the means
2/sqrt(n)
## [1] 0.6324555

Example - Bernoulli

  • for \(p = 0.5\), the Bernoulli distribution has variance of 0.25
  • means of random samples of \(n\) coin flips have standard deviations of \(1 / (2 \sqrt{n})\)
# estimated standard deviation of the sample means
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), nosim), 1, mean))
## [1] 0.1579688
# actual standard deviation of the means
1/(2*sqrt(n))
## [1] 0.1581139

Example - Father/Son

# load data
library(UsingR); data(father.son);
# define son height as the x variable
x <- father.son$sheight
# n is the length
n<-length(x)
# plot histogram for son's heights
g <- ggplot(data = father.son, aes(x = sheight))
g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black")
g <- g + geom_density(size = 2, colour = "black")
g

# we calculate the parameters for variance of distribution and sample mean,
round(c(sampleVar = var(x),
    sampleMeanVar = var(x) / n,
    # as well as standard deviation of distribution and sample mean
    sampleSd = sd(x),
    sampleMeanSd = sd(x) / sqrt(n)),2)
##     sampleVar sampleMeanVar      sampleSd  sampleMeanSd
##          7.92          0.01          2.81          0.09

Binomial Distribution

  • binomial random variable = sum of n Bernoulli variables \[X = \sum_{i=1}^n X_i\] where \(X_1,\ldots,X_n = Bernoulli(p)\)
    • PMF is defined as \[P(X=x) = {n \choose x}p^x(1-p)^{n-x}\] where \({n \choose x}\) = number of ways selecting \(x\) items out of \(n\) options without replacement or regard to order and for \(x=0,\ldots,n\)
    • combination or “\(n\) choose \(x\)” is defined as \[{n \choose x} = \frac{n!}{x!(n-x)!}\]
    • the base cases are \[{n \choose n} = {n \choose 0} = 1\]
  • Bernoulli distribution = binary outcome
    • only possible outcomes
      • 1 = “success” with probability of \(p\)
      • 0 = “failure” with probability of \(1 - p\)
    • PMF is defined as \[P(X=x) = p^x(1 - p)^{1-x}\]
    • mean = \(p\)
    • variance = \(p(1 - p)\)

Example

  • of 8 children, whats the probability of 7 or more girls (50/50 chance)? \[{8 \choose 7}.5^7(1-.5)^{1} + {8 \choose 8}.5^8(1-.5)^{0} \approx 0.04\]
# calculate probability using PMF
choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8
## [1] 0.03515625
# calculate probability using CMF from distribution
pbinom(6, size = 8, prob = .5, lower.tail = FALSE)
## [1] 0.03515625
  • choose(8, 7) = R function to calculate n choose x
  • pbinom(6, size=8, prob =0.5, lower.tail=TRUE) = probability of 6 or less successes out of 8 samples with probability of 0.5 (CMF)
    • lower.tail=FALSE = returns the complement, in this case it’s the probability of greater than 6 successes out of 8 samples with probability of 0.5

Normal Distribution

  • normal/Gaussian distribution for random variable X
    • notation = \(X \sim N(\mu, \sigma^2)\)
    • mean = \(E[X] = \mu\)
    • variance = \(Var(X) = \sigma^2\)
    • PMF is defined as \[f(x)=(2\pi\sigma^2)^{-1/2}e^{-(x-\mu)^2/2\sigma^2}\]
  • \(X \sim N(0, 1)\) = standard normal distribution (standard normal random variables often denoted using \(Z_1, Z_2, \ldots\))
    • Note: see below graph for reference for the following observations
    • ~68% of data/normal density \(\rightarrow\) between \(\pm\) 1 standard deviation from \(\mu\)
    • ~95% of data/normal density \(\rightarrow\) between \(\pm\) 2 standard deviation from \(\mu\)
    • ~99% of data/normal density \(\rightarrow\) between \(\pm\) 3 standard deviation from \(\mu\)
    • \(\pm\) 1.28 standard deviations from \(\mu\) \(\rightarrow\) 10\(^{th}\) (-) and 90\(^{th}\) (+) percentiles
    • \(\pm\) 1.645 standard deviations from \(\mu\) \(\rightarrow\) 5\(^{th}\) (-) and 95\(^{th}\) (+) percentiles
    • \(\pm\) 1.96 standard deviations from \(\mu\) \(\rightarrow\) 2.5\(^{th}\) (-) and 97.5\(^{th}\) (+) percentiles
    • \(\pm\) 2.33 standard deviations from \(\mu\) \(\rightarrow\) 1\(^{st}\) (-) and 99\(^{th}\) (+) percentiles
# plot standard normal
x <- seq(-3, 3, length = 1000)
g <- ggplot(data.frame(x = x, y = dnorm(x)),
            aes(x = x, y = y)) + geom_line(size = 2)
g <- g + geom_vline(xintercept = -3 : 3, size = 2)
g

  • for any \(X \sim N(\mu, \sigma^2)\), calculating the number of standard deviations each observation is from the mean converts the random variable to a standard normal (denoted as \(Z\) below) \[Z=\frac{X-\mu}{\sigma} \sim N(0,1)\]
  • conversely, a standard normal can then be converted to any normal distribution by multiplying by standard deviation and adding the mean \[X = \mu + \sigma Z \sim N(\mu, \sigma^2)\]
  • qnorm(n, mean=mu, sd=sd) = returns the \(n^{th}\) percentiles for the given normal distribution
  • pnorm(x, mean=mu, sd=sd, lower.tail=F) = returns the probability of an observation drawn from the given distribution is larger in value than the specified threshold \(x\)

Example

  • the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50
  • What’s the probability of getting more than 1,160 clicks in a day?
# calculate number of standard deviations from the mean
(1160 - 1020) / 50
## [1] 2.8
# calculate probability using given distribution
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
## [1] 0.00255513
# calculate probability using standard normal
pnorm(2.8, lower.tail = FALSE)
## [1] 0.00255513
  • therefore, it is not very likely (0.255513% chance), since 1,160 is 2.8 standard deviations from the mean
  • What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming days are independent and identically distributed)?
qnorm(0.75, mean = 1020, sd = 50)
## [1] 1053.724
  • therefore, 1053.7244875 would represent the threshold that has more clicks than 75% of days

Poisson Distribution

  • used to model counts
    • mean = \(\lambda\)
    • variance = \(\lambda\)
    • PMF is defined as \[P(X = x; \lambda)=\frac{\lambda^xe^{-\lambda}}{x!}\] where \(X = 0, 1, 2, ... \infty\)
  • modeling uses for Poisson distribution
    • count data
    • event-time/survival \(\rightarrow\) cancer trials, some patients never develop and some do, dealing with the data for both (“censoring”)
    • contingency tables \(\rightarrow\) record results for different characteristic measurements
    • approximating binomials \(\rightarrow\) instances where n is large and p is small (i.e. pollution on lung disease)
      • \(X \sim Binomial(n, p)\)
      • \(\lambda = np\)
    • rates \(\rightarrow\) \(X \sim Poisson(\lambda t)\)
      • \(\lambda = E[X/t]\) \(\rightarrow\) expected count per unit of time
      • \(t\) = total monitoring time
  • ppois(n, lambda = lambda*t) = returns probability of \(n\) or fewer events happening given the rate \(\lambda\) and time \(t\)

Example

  • number of people that show up at a bus stop can be modeled with Poisson distribution with a mean of 2.5 per hour
  • after watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?
# calculate using distribution
ppois(3, lambda = 2.5 * 4)
## [1] 0.01033605
  • as we can see from above, there is a 1.0336051% chance for 3 or fewer people show up total at the bus stop during 4 hours of monitoring

Example - Approximating Binomial Distribution

  • flip a coin with success probability of 0.01 a total 500 times (low \(p\), large \(n\))
  • what’s the probability of 2 or fewer successes?
# calculate correct probability from Binomial distribution
pbinom(2, size = 500, prob = .01)
## [1] 0.1233858
# estimate probability using Poisson distribution
ppois(2, lambda=500 * .01)
## [1] 0.124652
  • as we can see from above, the two probabilities (12.3385774% vs 12.3385774%) are extremely close

Asymptotics

  • asymptotics = behavior of statistics as sample size \(\rightarrow\) \(\infty\)
  • useful for simple statistical inference/approximations
  • form basis for frequentist interpretation of probabilities (“Law of Large Numbers”)

Law of Large Numbers (LLN)

  • IID sample statistic that estimates property of the sample (i.e. mean, variance) becomes the population statistic (i.e. population mean, population variance) as \(n\) increases
  • Note: an estimator is consistent if it converges to what it is estimating
  • sample mean/variance/standard deviation are all consistent estimators for their population counterparts
    • \(\bar X_n\) is average of the result of \(n\) coin flips (i.e. the sample proportion of heads)
    • as we flip a fair coin over and over, it eventually converges to the true probability of a head

Example - LLN for Normal and Bernoulli Distribution

  • for this example, we will simulate 10000 samples from the normal and Bernoulli distributions respectively
  • we will plot the distribution of sample means as \(n\) increases and compare it to the population means
# load library
library(gridExtra)
# specify number of trials
n <- 10000
# calculate sample (from normal distribution) means for different size of n
means <- cumsum(rnorm(n)) / (1  : n)
# plot sample size vs sample mean
g <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0) + geom_line(size = 2)
g <- g + labs(x = "Number of obs", y = "Cumulative mean")
g <- g + ggtitle("Normal Distribution")
# calculate sample (coin flips) means for different size of n
means <- cumsum(sample(0 : 1, n , replace = TRUE)) / (1  : n)
# plot sample size vs sample mean
p <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
p <- p + geom_hline(yintercept = 0.5) + geom_line(size = 2)
p <- p + labs(x = "Number of obs", y = "Cumulative mean")
p <- p + ggtitle("Bernoulli Distribution (Coin Flip)")
# combine plots
grid.arrange(g, p, ncol = 2)

  • as we can see from above, for both distributions the sample means undeniably approach the respective population means as \(n\) increases

Central Limit Theorem

  • one of the most important theorems in statistics
  • distribution of means of IID variables approaches the standard normal as sample size \(n\) increases
  • in other words, for large values of \(n\), \[\frac{\mbox{Estimate} - \mbox{Mean of Estimate}}{\mbox{Std. Err. of Estimate}} = \frac{\bar X_n - \mu}{\sigma / \sqrt{n}}=\frac{\sqrt n (\bar X_n - \mu)}{\sigma} \longrightarrow N(0, 1)\]
  • this translates to the distribution of the sample mean \(\bar X_n\) is approximately \(N(\mu, \sigma^2/n)\)
    • distribution is centered at the population mean
    • with standard deviation = standard error of the mean
  • typically the Central Limit Theorem can be applied when \(n \geq 30\)

Example - CLT with Bernoulli Trials (Coin Flips)

  • for this example, we will simulate \(n\) flips of a possibly unfair coin
    • let \(X_i\) be the 0 or 1 result of the \(i^{th}\) flip of a possibly unfair coin
    • sample proportion , \(\hat p\), is the average of the coin flips
    • \(E[X_i] = p\) and \(Var(X_i) = p(1-p)\)
    • standard error of the mean is \(SE = \sqrt{p(1-p)/n}\)
  • in principle, normalizing the random variable \(X_i\), we should get an approximately standard normal distribution \[\frac{\hat p - p}{\sqrt{p(1-p)/n}} \sim N(0,~1)\]
  • therefore, we will flip a coin \(n\) times, take the sample proportion of heads (successes with probability \(p\)), subtract off 0.5 (ideal sample proportion) and multiply the result by \(\frac{1}{2 \sqrt{n}}\) and compare it to the standard normal

  • now, we can run the same simulation trials for an extremely unfair coin with \(p\) = 0.9

  • as we can see from both simulations, the converted/standardized distribution of the samples convert to the standard normal distribution
  • Note: speed at which the normalized coin flips converge to normal distribution depends on how biased the coin is (value of \(p\))
  • Note: does not guarantee that the normal distribution will be a good approximation, but just that eventually it will be a good approximation as n \(\rightarrow \infty\)

Confidence Intervals - Normal Distribution/Z Intervals

  • Z confidence interval is defined as \[Estimate \pm ZQ \times SE_{Estimate}\] where \(ZQ\) = quantile from the standard normal distribution
  • according to CLT, the sample mean, \(\bar X\), is approximately normal with mean \(\mu\) and sd \(\sigma / \sqrt{n}\)
  • 95% confidence interval for the population mean \(\mu\) is defined as \[\bar X \pm 2\sigma/\sqrt{n}\] for the sample mean \(\bar X \sim N(\mu, \sigma^2/n)\)
    • you can choose to use 1.96 to be more accurate for the confidence interval
    • \(P(\bar{X} > \mu + 2\sigma/\sqrt{n}~or~\bar{X} < \mu - 2\sigma/\sqrt{n}) = 5\%\)
    • interpretation: if we were to repeatedly draw samples of size \(n\) from the population and construct this confidence interval for each case, approximately 95% of the intervals will contain \(\mu\)
  • confidence intervals get narrower with less variability or larger sample sizes
  • Note: Poisson and binomial distributions have exact intervals that don’t require CLT
  • example
    • for this example, we will compute the 95% confidence interval for sons height data in inches
# load son height data
data(father.son); x <- father.son$sheight
# calculate confidence interval for sons height in inches
mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x))
## [1] 68.51605 68.85209

Confidence Interval - Bernoulli Distribution/Wald Interval

  • for Bernoulli distributions, \(X_i\) is 0 or 1 with success probability \(p\) and the variance is \(\sigma^2 = p(1 - p)\)
  • the confidence interval takes the form of \[\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]
  • since the population proportion \(p\) is unknown, we can use the sampled proportion of success \(\hat{p} = X/n\) as estimate
  • \(p(1-p)\) is largest when \(p = 1/2\), so 95% confidence interval can be calculated by \[\begin{aligned} \hat{p} \pm Z_{0.95} \sqrt{\frac{0.5(1-0.5)}{n}} & = \hat{p} \pm qnorm(.975) \sqrt{\frac{1}{4n}}\\ & = \hat{p} \pm 1.96 \sqrt{\frac{1}{4n}}\\ & = \hat{p} \pm \frac{1.96}{2} \sqrt{\frac{1}{n}}\\ & \approx \hat{p} \pm \frac{1}{\sqrt{n}}\\ \end{aligned}\]
    • this is known as the Wald Confidence Interval and is useful in roughly estimating confidence intervals
    • generally need \(n\) = 100 for 1 decimal place, 10,000 for 2, and 1,000,000 for 3
  • example
    • suppose a random sample of 100 likely voters, 56 intent to vote for you, can you secure a victory?
    • we can use the Wald interval to quickly estimate the 95% confidence interval
    • as we can see below, because the interval [0.46, 0.66] contains values below 50%, victory is not guaranteed
    • binom.test(k, n)$conf = returns confidence interval binomial distribution (collection of Bernoulli trial) with k successes in n draws
# define sample probability and size
p = 0.56; n = 100
# Wald interval
c("WaldInterval" = p + c(-1, 1) * 1/sqrt(n))
## WaldInterval1 WaldInterval2
##          0.46          0.66
# 95% confidence interval
c("95CI" = p + c(-1, 1) * qnorm(.975) * sqrt(p * (1-p)/n))
##     95CI1     95CI2
## 0.4627099 0.6572901
# perform binomial test
binom.test(p*100, n*100)$conf.int
## [1] 0.004232871 0.007265981
## attr(,"conf.level")
## [1] 0.95

Confidence Interval - Binomial Distribution/Agresti-Coull Interval

  • for a binomial distribution with smaller values of \(n\) (when \(n\) < 30, thus not large enough for CLT), often time the normal confidence intervals, as defined by \[\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\] do not provide accurate estimates
# simulate 1000 samples of size 20 each
n <- 20; nosim <- 1000
# simulate for p values from 0.1 to 0.9
pvals <- seq(.1, .9, by = .05)
# calculate the confidence intervals
coverage <- sapply(pvals, function(p){
    # simulate binomial data
    phats <- rbinom(nosim, prob = p, size = n) / n
    # calculate lower 95% CI bound
    ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate upper 95% CI bound
    ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate percent of intervals that contain p
    mean(ll < p & ul > p)
})
# plot CI results vs 95%
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0)

  • as we can see from above, the interval do not provide adequate coverage as 95% confidence intervals (frequently only provide 80 to 90% coverage)
  • we can construct the Agresti-Coull Interval, which is defined uses the adjustment \[\hat{p} = \frac{X+2}{n+4}\] where we effectively add 2 to number of successes, \(X\), and add 2 to number of failure
  • therefore the interval becomes \[\frac{X+2}{n+4} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]
  • Note: interval tend to be conservative
  • example
# simulate 1000 samples of size 20 each
n <- 20; nosim <- 1000
# simulate for p values from 0.1 to 0.9
pvals <- seq(.1, .9, by = .05)
# calculate the confidence intervals
coverage <- sapply(pvals, function(p){
    # simulate binomial data with Agresti/Coull Interval adjustment
    phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
        # calculate lower 95% CI bound
    ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate upper 95% CI bound
    ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate percent of intervals that contain p
    mean(ll < p & ul > p)
})
# plot CI results vs 95%
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0)

  • as we can see from above, the coverage is much better for the 95% interval
  • in fact, all of the estimates are more conservative as we previously discussed, indicating the Agresti-Coull intervals are wider than the regular confidence intervals

Confidence Interval - Poisson Interval

  • for \(X \sim Poisson(\lambda t)\)
    • estimate rate \(\hat{\lambda} = X/t\)
    • \(var(\hat{\lambda}) = \lambda/t\)
    • variance estimate \(= \hat{\lambda}/t\)
  • so the confidence interval is defined as \[\hat \lambda \pm z_{1-\alpha/2}\sqrt{\frac{\lambda}{t}}\]
    • however, for small values of \(\lambda\) (few events larger time interval), we should not use the asymptotic interval estimated
    • example
      • for this example, we will go through a specific scenario as well as a simulation exercise to demonstrate the ineffectiveness of asymptotic intervals for small values of \(\lambda\)
      • nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
      • poisson.test(x, T)$conf = returns Poisson 95% confidence interval for given x occurrence over T time period
# define parameters
x <- 5; t <- 94.32; lambda <- x / t
# calculate confidence interval
round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)
## [1] 0.007 0.099
# return accurate confidence interval from poisson.test
poisson.test(x, T = 94.32)$conf
## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95
# small lambda simulations
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 100
# calculate coverage using Poisson intervals
coverage <- sapply(lambdavals, function(lambda){
    # calculate Poisson rates
    lhats <- rpois(nosim, lambda = lambda * t) / t
    # lower bound of 95% CI
    ll <- lhats - qnorm(.975) * sqrt(lhats / t)
    # upper bound of 95% CI
    ul <- lhats + qnorm(.975) * sqrt(lhats / t)
    # calculate percent of intervals that contain lambda
    mean(ll < lambda & ul > lambda)
})
# plot CI results vs 95%
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0)

  • as we can see above, for small values of \(\lambda = X/t\), the confidence interval produced by the asymptotic interval is not an accurate estimate of the actual 95% interval (not enough coverage)
  • however, as \(t \to \infty\), the interval becomes the true 95% interval
# small lambda simulations
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 1000
# calculate coverage using Poisson intervals
coverage <- sapply(lambdavals, function(lambda){
    # calculate Poisson rates
    lhats <- rpois(nosim, lambda = lambda * t) / t
    # lower bound of 95% CI
    ll <- lhats - qnorm(.975) * sqrt(lhats / t)
    # upper bound of 95% CI
    ul <- lhats + qnorm(.975) * sqrt(lhats / t)
    # calculate percent of intervals that contain lambda
    mean(ll < lambda & ul > lambda)
})
# plot CI results vs 95%
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(0, 1.0)

  • as we can see from above, as \(t\) increases, the Poisson intervals become closer to the actual 95% confidence intervals

Confidence Intervals - T Distribution(Small Samples)

  • t confidence interval is defined as \[Estimate \pm TQ \times SE_{Estimate} = \bar X \pm \frac{t_{n-1} S}{\sqrt{n}}\]
    • \(TQ\) = quantile from T distribution
    • \(t_{n-1}\) = relevant quantile
    • \(t\) interval assumes data is IID normal so that \[\frac{\bar X - \mu}{S/\sqrt{n}}\] follows Gosset’s \(t\) distribution with \(n-1\) degrees of freedom
    • works well with data distributions that are roughly symmetric/mound shaped, and does not work with skewed distributions
      • skewed distribution \(\rightarrow\) meaningless to center interval around the mean \(\bar X\)
      • logs/median can be used instead
    • paired observations (multiple measurements from same subjects) can be analyzed by t interval of differences
    • as more data collected (large degrees of freedom), t interval \(\rightarrow\) z interval
    • qt(0.975, df=n-1) = calculate the relevant quantile using t distribution
# Plot normal vs t distributions
k <- 1000; xvals <- seq(-5, 5, length = k); df <- 10
d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),x = xvals,
              dist = factor(rep(c("Normal", "T"), c(k,k))))
g <- ggplot(d, aes(x = x, y = y))
g <- g + geom_line(size = 2, aes(colour = dist)) + ggtitle("Normal vs T Distribution")
# plot normal vs t quantiles
d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),p = pvals)
h <- ggplot(d, aes(x= n, y = t))
h <- h + geom_abline(size = 2, col = "lightblue")
h <- h + geom_line(size = 2, col = "black")
h <- h + geom_vline(xintercept = qnorm(0.975))
h <- h + geom_hline(yintercept = qt(0.975, df)) + ggtitle("Normal vs T Quantiles")
# plot 2 graphs together
grid.arrange(g, h, ncol = 2)

  • William Gosset’s t Distribution (“Student’s T distribution”)
    • test = Gosset’s pseudoname which he published under
    • indexed/defined by degrees of freedom, and becomes more like standard normal as degrees of freedom gets larger
    • thicker tails centered around 0, thus confidence interval = wider than Z interval (more mass concentrated away from the center)
    • for small sample size (value of n), normalizing the distribution by \(\frac{\bar X - \mu}{S/\sqrt{n}}\) \(\rightarrow\) t distribution, not the standard normal distribution
      • \(S\) = standard deviation may be inaccurate, as the std of the data sample may not be truly representative of the population std
      • using the Z interval here thus may produce an interval that is too narrow

Confidence Interval - Paired T Tests

  • compare observations for the same subjects over two different sets of data (i.e. different times, different treatments)
  • the confidence interval is defined by \[ \bar X_1 - \bar X_2 \pm \frac{t_{n-1} S}{\sqrt{n}}\] where \(\bar X_1\) represents the first observations and \(\bar X_2\) the second set of observations
  • t.test(difference) = performs group mean t test and returns metrics as results, which includes the confidence intervals
    • t.test(g2, g1, paired = TRUE) = performs the same paired t test with data directly
  • example
    • the data used here is for a study of the effects of two soporific drugs (increase in hours of sleep compared to control) on 10 patients
# load data
data(sleep)
# plot the first and second observations
g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID)))
g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5)
g

# define groups
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
# define difference
difference <- g2 - g1
# calculate mean and sd of differences
mn <- mean(difference); s <- sd(difference); n <- 10
# calculate intervals manually
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## [1] 0.7001142 2.4598858
# perform the same test to get confidence intervals
t.test(difference)
##
##  One Sample t-test
##
## data:  difference
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of x
##      1.58
t.test(g2, g1, paired = TRUE)
##
##  Paired t-test
##
## data:  g2 and g1
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of the differences
##                    1.58

Independent Group t Intervals - Same Variance

  • compare two groups in randomized trial (“A/B Testing”)
  • cannot use the paired t test because the groups are independent and may have different sample sizes
  • perform randomization to balance unobserved covariance that may otherwise affect the result
  • \(t\) confidence interval for \(\mu_y - \mu_x\) is defined as \[\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}\]
    • \(t_{n_x + n_y - 2, 1 - \alpha/2}\) = relevant quantile
    • \(n_x + n_y - 2\) = degrees of freedom
    • \(S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}\) = standard error
    • \(S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)\) = pooled variance estimator
      • this is effectively a weighted average between the two variances, such that different sample sizes are taken in to account
      • For equal sample sizes, \(n_x = n_y\), \(S_p^2 = \frac{S_x^2 + S_y^2}{2}\) (average of variance of two groups)
    • Note: this interval assumes constant variance across two groups; if variance is different, use the next interval

Independent Group t Intervals - Different Variance

  • confidence interval for \(\mu_y - \mu_x\) is defined as \[\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}\]
    • \(t_{df}\) = relevant quantile with df as defined below
    • Note: normalized statistic does not follow t distribution but can be approximated through the formula with df defined below \[df = \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)}\]
      • \(\left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}\) = standard error
  • Comparing other kinds of data
    • binomial \(\rightarrow\) relative risk, risk difference, odds ratio
    • binomial \(\rightarrow\) Chi-squared test, normal approximations, exact tests
    • count \(\rightarrow\) Chi-squared test, exact tests
  • R commands
    • t Confidence Intervals
      • mean + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)
        • c(-1, 1) = plus and minus, \(\pm\)
    • Difference Intervals (all equivalent)
      • mean2 - mean1 + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)
        • n = number of paired observations
        • qt(0.975, n - 1) = relevant quantile for paired
        • qt(0.975, n\(_x\) + n\(_y\) - 2) = relevant quantile for independent
      • t.test(mean2 - mean1)
      • t.test(data2, data1, paired = TRUE, var.equal = TRUE)
        • paired = whether or not the two sets of data are paired (same subjects different observations for treatment) \(\rightarrow\) TRUE for paired, FALSE for independent
        • var.equal = whether or not the variance of the datasets should be treated as equal \(\rightarrow\) TRUE for same variance, FALSE for unequal variances
      • t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
        • relevel(factor, ref) = reorders the levels in the factor so that “ref” is changed to the first level \(\rightarrow\) doing this here is so that the second set of measurements come first (1, 2 \(\rightarrow\) 2, 1) in order to perform mean\(_2\) - mean\(_1\)
        • I(object) = prepend the class “AsIs” to the object
        • Note: I(relevel(group, 2)) = explanatory variable, must be factor and have two levels

Hypothesis Testing

  • Hypothesis testing = making decisions using data
    • null hypothesis (H\(_0\)) = status quo
    • assumed to be true \(\rightarrow\) statistical evidence required to reject it for alternative or “research” hypothesis (H\(_a\))
      • alternative hypothesis typically take form of >, < or \(\ne\)
    • Results
Truth Decide Result
\(H_0\) \(H_0\) Correctly accept null
\(H_0\) \(H_a\) Type I error
\(H_a\) \(H_a\) Correctly reject null
\(H_a\) \(H_0\) Type II error
  • \(\alpha\) = Type I error rate
    • probability of rejecting the null hypothesis when the hypothesis is correct
    • \(\alpha\) = 0.05 \(\rightarrow\) standard for hypothesis testing
    • Note: as Type I error rate increases, Type II error rate decreases and vice versa
  • for large samples (large n), use the Z Test for \(H_0:\mu = \mu_0\)
    • \(H_a\):
      • \(H_1: \mu < \mu_0\)
      • \(H_2: \mu \neq \mu_0\)
      • \(H_3: \mu > \mu_0\)
    • Test statistic \(TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}\)
    • Reject the null hypothesis \(H_0\) when
      • \(H_1: TS \leq Z_{\alpha}\) OR \(-Z_{1 - \alpha}\)
      • \(H_2: |TS| \geq Z_{1 - \alpha / 2}\)
      • \(H_3: TS \geq Z_{1 - \alpha}\)
    • Note: In case of \(\alpha\) = 0.05 (most common), \(Z_{1-\alpha}\) = 1.645 (95 percentile)
    • \(\alpha\) = low, so that when \(H_0\) is rejected, original model \(\rightarrow\) wrong or made an error (low probability)
  • For small samples (small n), use the T Test for \(H_0:\mu = \mu_0\)
    • \(H_a\):
      • \(H_1: \mu < \mu_0\)
      • \(H_2: \mu \neq \mu_0\)
      • \(H_3: \mu > \mu_0\)
    • Test statistic \(TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}\)
    • Reject the null hypothesis \(H_0\) when
      • \(H_1: TS \leq T_{\alpha}\) OR \(-T_{1 - \alpha}\)
      • \(H_2: |TS| \geq T_{1 - \alpha / 2}\)
      • \(H_3: TS \geq T_{1 - \alpha}\)
    • Note: In case of \(\alpha\) = 0.05 (most common), \(T_{1-\alpha}\) = qt(.95, df = n-1)
    • R commands for T test:
      • t.test(vector1 - vector2)
      • t.test(vector1, vector2, paired = TRUE)
        • alternative argument can be used to specify one-sided tests: less or greater
        • alternative default = two-sided
      • prints test statistic (t), degrees of freedom (df), p-value, 95% confidence interval, and mean of sample
        • confidence interval in units of data, and can be used to intepret the practical significance of the results
  • rejection region = region of TS values for which you reject \(H_0\)
  • power = probability of rejecting \(H_0\)
    • power is used to calculate sample size for experiments
  • two-sided tests \(\rightarrow\) \(H_a: \mu \neq \mu_0\)
    • reject \(H_0\) only if test statistic is too larger/small
    • for \(\alpha\) = 0.05, split equally to 2.5% for upper and 2.5% for lower tails
      • equivalent to \(|TS| \geq T_{1 - \alpha / 2}\)
      • example: for T test, qt(.975, df) and qt(.025, df)
    • Note: failing to reject one-sided test = fail to reject two-sided
  • tests vs confidence intervals
    • (\(1-\alpha\))% confidence interval for \(\mu\) = set of all possible values that fail to reject \(H_0\)
    • if (\(1-\alpha\))% confidence interval contains \(\mu_0\), fail to reject \(H_0\)
  • two-group intervals/test
    • Rejection rules the same
    • Test \(H_0\): \(\mu_1 = \mu_2\) \(\rightarrow\) \(\mu_1 - \mu_2 = 0\)
    • Test statistic: \[\frac{Estimate - H_0 Value}{SE_{Estimate}} = \frac{\bar X_1 - \bar X_2 - 0}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]
    • R Command
      • t.test(values ~ factor, paired = FALSE, var.equal = TRUE, data = data)
        • paired = FALSE = independent values
        • factor argument must have only two levels
  • p values
    • most common measure of statistical significance
    • p-value = probability under the null hypothesis of obtaining evidence as extreme or more than that of the obtained
      • Given that \(H_0\) is true, how likely is it to obtain the result (test statistic)?
    • attained significance level = smallest value for \(\alpha\) for which \(H_0\) is rejected \(\rightarrow\) equivalent to p-value
      • if p-value < \(\alpha\), reject \(H_0\)
      • for two-sided tests, double the p-values
    • if p-value is small, either \(H_0\) is true AND the obeserved is a rare event OR \(H_0\) is false
    • R Command
      • p-value = pt(statistic, df, lower.tail = FALSE)
        • lower.tail = FALSE = returns the probability of getting a value from the t distribution that is larger than the test statistic
      • Binomial (coin flips)
        • probability of getting x results out of n trials and event probability of p = pbinom(x, size = n, prob = p, lower.tail = FALSE)
        • two-sided interval (testing for \(\ne\)): find the smaller of two one-sided intervals (X < value, X > value), and double the result
        • Note: lower.tail = FALSE = strictly greater
      • Poisson
        • probability of getting x results given the rate r = ppois(x - 1, r, lower.tail = FALSE)
        • x - 1 is used here because the upper tail includes the specified number (since we want greater than x, we start at x - 1)
        • r = events that should occur given the rate (multiplied by 100 to yield an integer)
        • Note: lower.tail = FALSE = strictly greater

Power

  • Power = probability of rejecting the null hypothesis when it is false (the more power the better)
    • most often used in designing studies so that there’s a reasonable chance to detect the alternative hypothesis if the alternative hypothesis is true
  • \(\beta\) = probability of type II error = failing to reject the null hypothesis when it’s false
  • power = \(1 - \beta\)
  • example
    • \(H_0: \mu = 30 \to \bar X \sim N(\mu_0, \sigma^2/n)\)
    • \(H_a: \mu > 30 \to \bar X \sim N(\mu_a, \sigma^2/n)\)
    • Power: \[Power = P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~;~ \mu = \mu_a \right)\]
      • Note: the above function depends on value of \(\mu_a\)
      • Note: as \(\mu_a\) approaches 30, power approaches \(\alpha\)
    • assuming the sample mean is normally distributed, \(H_0\) is rejected when \(\frac{\bar X - 30}{\sigma /\sqrt{n}} > Z_{1-\alpha}\)
    • or, \(\bar X > 30 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}\)
  • R commands:
    • alpha = 0.05; z = qnorm(1-alpha) \(\rightarrow\) calculates \(Z_{1-\alpha}\)
    • pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FALSE) \(\rightarrow\) calculates the probability of getting a sample mean that is larger than \(Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}\) given that the population mean is \(\mu_a\)
      • Note: using mean = mu0 in the function would = \(\alpha\)
    • Power curve behavior
      • Power increases as \(mu_a\) increases \(\rightarrow\) we are more likely to detect the difference in \(mu_a\) and \(mu_0\)
      • Power increases as n increases \(\rightarrow\) with more data, more likely to detect any alternative \(mu_a\)
library(ggplot2)
mu0 = 30; mua = 32; sigma = 4; n = 16
alpha = 0.05
z = qnorm(1 - alpha)
nseq = c(8, 16, 32, 64, 128)
mu_a = seq(30, 35, by = 0.1)
power = sapply(nseq, function(n)
    pnorm(mu0 + z * sigma / sqrt(n), mean = mu_a, sd = sigma / sqrt(n),
          lower.tail = FALSE)
    )
colnames(power) <- paste("n", nseq, sep = "")
d <- data.frame(mu_a, power)
library(reshape2)
d2 <- melt(d, id.vars = "mu_a")
names(d2) <- c("mu_a", "n", "power")
g <- ggplot(d2,
            aes(x = mu_a, y = power, col = n)) + geom_line(size = 2)
g

  • Solving for Power
    • When testing \(H_a : \mu > \mu_0\) (or \(<\) or \(\neq\)) \[Power = 1 - \beta = P\left(\bar X > \mu_0 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)\] where \(\bar X \sim N(\mu_a, \sigma^2 / n)\)
    • Unknowns = \(\mu_a\), \(\sigma\), \(n\), \(\beta\)
    • Knowns = \(\mu_0\), \(\alpha\)
    • Specify any 3 of the unknowns and you can solve for the remainder; most common are two cases
      1. Given power desired, mean to detect, variance that we can tolerate, find the n to produce desired power (designing experiment/trial)
      2. Given the size n of the sample, find the power that is achievable (finding the utility of experiment)
    • Note: for \(H_a: \mu \neq mu_0\), calculated one-sided power using \(z_{1-\alpha / 2}\); however, the power calculation here exclusdes the probability of getting a large TS in the opposite direction of the truth, but this is only applicable when \(\mu_a\) and \(\mu_0\) are close together
  • Power Behavior
    • Power increases as \(\alpha\) becomes larger
    • Power of one-sided test \(>\) power of associated two-sided test
    • Power increases as \(\mu_a\) gets further away from \(\mu_0\)
    • Power increases as n increases (sample mean has less variability)
    • Power increases as \(\sigma\) decreases (again less variability)
    • Power usually depends only \(\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}\), and not \(\mu_a\), \(\sigma\), and \(n\)
      • effect size = \(\frac{\mu_a - \mu_0}{\sigma}\) \(\rightarrow\) unit free, can be interpretted across settings
  • T-test Power
    • for Gossett’s T test, \[Power = P\left(\frac{\bar X - \mu_0}{S/\sqrt{n}} > t_{1-\alpha, n-1} ; \mu = \mu_a \right)\]
      • \(\frac{\bar X - \mu_0}{S/\sqrt{n}}\) does not follow a t distribution if the true mean is \(\mu_a\) and NOT \(\mu_0\) \(\rightarrow\) follows a non-central t distribution instead
    • power.t.test = evaluates the non-central t distribution and solves for a parameter given all others are specified
      • power.t.test(n = 16, delta = 0.5, sd = 1, type = "one.sample", alt = "one.sided")$power = calculates power with inputs of n, difference in means, and standard deviation
        • delta = argument for difference in means
        • Note: since effect size = delta/sd, as n, type, and alt are held constant, any distribution with the same effect size will have the same power
      • power.t.test(power = 0.8, delta = 0.5, sd = 1, type = "one.sample", alt = "one.sided")$n = calculates size n with inputs of power, difference in means, and standard deviation
        • Note: n should always be rounded up (ceiling)

Multiple Testing

  • Hypothesis testing/significant analysis commonly overused
  • correct for multiple testing to avoid false positives/conclusions (two key components)
    1. error measure
    2. correction
  • multiple testing is needed because of the increase in ubiquitous data collection technology and analysis
    • DNA sequencing machines
    • imaging patients in clinical studies
    • electronic medical records
    • individualized movement data (fitbit)

Type of Errors

Actual \(H_0\) = True Actual \(H_a\) = True Total
Conclude \(H_0\) = True (non-significant) \(U\) \(T\) \(m-R\)
Conclude \(H_a\) = True (significant) \(V\) \(S\) \(R\)
Total \(m_0\) \(m-m_0\) \(m\)
  • \(m_0\) = number of true null hypotheses, or cases where \(H_0\) = actually true (unknown)
  • \(m - m_0\) = number of true alternative hypotheses, or cases where \(H_a\) = actually true (unknown)
  • \(R\) = number of null hypotheses rejected, or cases where \(H_a\) = concluded to be true (measurable)
  • \(m - R\) = number of null hypotheses that failed to be rejected, or cases where \(H_0\) = concluded to be true (measurable)
  • \(V\) = Type I Error / false positives, concludes \(H_a\) = True when \(H_0\) = actually True
  • \(T\) = Type II Error / false negatives, concludes \(H_0\) = True when \(H_a\) = actually True
  • \(S\) = true positives, concludes \(H_a\) = True when \(H_a\) = actually True
  • \(U\) = true negatives, concludes \(H_0\) = True when \(H_0\) = actually True

Error Rates

  • false positive rate = rate at which false results are called significant \(E[\frac{V}{m_0}]\) \(\rightarrow\) average fraction of times that \(H_a\) is claimed to be true when \(H_0\) is actually true
    • Note: mathematically equal to type I error rate \(\rightarrow\) false positive rate is associated with a post-prior result, which is the expected number of false positives divided by the total number of hypotheses under the real combination of true and non-true null hypotheses (disregarding the “global null” hypothesis). Since the false positive rate is a parameter that is not controlled by the researcher, it cannot be identified with the significance level, which is what determines the type I error rate.
  • family wise error rate (FWER) = probability of at least one false positive \(Pr(V \ge 1)\)
  • false discovery rate (FDR) = rate at which claims of significance are false \(E[\frac{V}{R}]\)

  • controlling error rates (adjusting \(\alpha\))
    • false positive rate
      • if we call all \(P<\alpha\) significant (reject \(H_0\)), we are expected to get \(\alpha \times m\) false positives, where \(m\) = total number of hypothesis test performed
      • with high values of \(m\), false positive rate is very large as well
    • family-wise error rate (FWER)
      • controlling FWER = controlling the probability of even one false positive
      • bonferroni correction (oldest multiple testing correction)
        • for \(m\) tests, we want \(Pr(V \ge 1) < \alpha\)
        • calculate P-values normally, and deem them significant if and only if \(P < \alpha_{fewer} = \alpha / m\)
      • easy to calculate, but tend to be very conservative
    • false discovery rate (FDR)
      • most popular correction = controlling FDR
      • for \(m\) tests, we want \(E[\frac{V}{R}] \le \alpha\)
      • calculate P-values normally and sort some from smallest to largest \(\rightarrow\) \(P_{(1)},P_{(1)}, ... , P_{(m)}\)
      • deem the P-values significant if \(P_{(i)} \le \alpha \times \frac{i}{m}\)
      • easy to calculate, less conservative, but allows for more false positives and may behave strangely under dependence (related hypothesis tests/regression with different variables)
    • example
      • 10 P-values with \(\alpha = 0.20\)
      alt text
  • adjusting for p-values
    • Note: changing P-values will fundamentally change their properties but they can be used directly without adjusting \(/alpha\)
    • bonferroni (FWER)
      • \(P_i^{fewer} = max(mP_i, 1)\) \(\rightarrow\) since p cannot exceed value of 1
      • deem P-values significant if \(P_i^{fewer} < \alpha\)
      • similar to controlling FWER

Example

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  x <- rnorm(20)
  # First 500 beta=0, last 500 beta=2
  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
  # calculating p-values by using linear model; the [2, 4] coeff in result = pvalue
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
##        trueStatus
##         not zero zero
##   FALSE        0  476
##   TRUE       500   24
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
##        trueStatus
##         not zero zero
##   FALSE       23  500
##   TRUE       477    0
# Controls FDR (Benjamin Hochberg)
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
##        trueStatus
##         not zero zero
##   FALSE        0  487
##   TRUE       500   13

Resample Inference

  • Bootstrap = useful tool for constructing confidence intervals and caclulating standard errors for difficult statistics
    • principle = if a statistic’s (i.e. median) sampling distribution is unknown, then use distribution defined by the data to approximate it
    • procedures
      1. simulate \(n\) observations with replacement from the observed data \(\rightarrow\) results in \(1\) simulated complete data set
      2. calculate desired statistic (i.e. median) for each simulated data set
      3. repeat the above steps \(B\) times, resulting in \(B\) simulated statistics
      4. these statistics are approximately drawn from the sampling distribution of the true statistic of \(n\) observations
      5. perform one of the following
        • plot a histogram
        • calculate standard deviation of the statistic to estimate its standard error
        • take quantiles (2.5th and 97.5th) as a confidence interval for the statistic (“bootstrap CI”)
    • example
      • Bootstrap procedure for calculating confidence interval for the median from a data set of \(n\) observations \(\rightarrow\) approximate sampling distribution
# load data
library(UsingR); data(father.son)
# observed dataset
x <- father.son$sheight
# number of simulated statistic
B <- 1000
# generate samples
resamples <- matrix(
    sample(x,               # sample to draw frome
           n * B,           # draw B datasets with n observations each
           replace = TRUE), # cannot draw n*B elements from x (has n elements) without replacement
    B, n)                   # arrange results into n x B matrix
                            # (every row = bootstrap sample with n observations)
# take median for each row/generated sample
medians <- apply(resamples, 1, median)
# estimated standard error of median
sd(medians)
## [1] 0.76595
# confidence interval of median
quantile(medians, c(.025, .975))
##     2.5%    97.5%
## 67.18292 70.16488
# histogram of bootstraped samples
hist(medians)

  • Note: better percentile bootstrap confidence interval = “bias corrected and accelerated interval” in bootstrap package

  • Permutation Tests
    • procedures
      • compare groups of data and test the null hypothesis that the distribution of the observations from each group = same
        • Note: if this is true, then group labels/divisions are irrelevant
      • permute the labels for the groups
      • recalculate the statistic
        • Mean difference in counts
        • Geometric means
        • T statistic
      • Calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed
    • variations

      Data type Statistic Test name
      Ranks rank sum rank sum test
      Binary hypergeometric prob Fisher’s exact test
      Raw data ordinary permutation test
      • Note: randomization tests are exactly permutation tests, with a different motivation
      • For matched data, one can randomize the signs
      • For ranks, this results in the signed rank test
      • Permutation strategies work for regression by permuting a regressor of interest
      • Permutation tests work very well in multivariate settings
    • example
      • we will compare groups B and C in this dataset for null hypothesis \(H_0:\) there are no difference between the groups

  • we will compare groups B and C in this dataset for null hypothesis \(H_0:\) there are no difference between the groups
# subset to only "B" and "C" groups
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
# values
y <- subdata$count
# labels
group <- as.character(subdata$spray)
# find mean difference between the groups
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
observedStat
## [1] 13.25
  • the observed difference between the groups is 13.25
  • now we changed the resample the lables for groups B and C
# create 10000 permutations of the data with the labels' changed
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
# find the number of permutations whose difference that is bigger than the observed
mean(permutations > observedStat)
## [1] 0
  • we created 1000 permutations from the observed dataset, and found no datasets with mean differences between groups B and C larger than the original data
  • therefore, p-value is very small and we can reject the null hypothesis with any reasonable \(\alpha\) levels
  • below is the plot for the null distribution/permutations

  • as we can see from the black line, the observed difference/statistic is very far from the mean \(\rightarrow\) likely 0 is not the true difference
    • with this information, formal confidence intervals can be constructed and p-values can be calculated

Miguel Angel Huerta

16 de octubre de 2018